Internal Clustering Evaluation of Data Streams
نویسندگان
چکیده
Clustering validation is a crucial part of choosing a clustering algorithm which performs best for an input data. Internal clustering validation is efficient and realistic, whereas external validation requires a ground truth which is not provided in most applications. In this paper, we analyze the properties and performances of eleven internal clustering measures. In particular, as the importance of streaming data grows, we apply these measures to carefully synthesized stream scenarios to reveal how they react to clusterings on evolving data streams. A series of experimental results show that different from the case with static data, the Calinski-Harabasz index performs the best in coping with common aspects and errors of stream clustering.
منابع مشابه
Benchmarking Stream Clustering Algorithms within the MOA Framework
In today’s applications, massive, evolving data streams are ubiquitous. To gain useful information from this data, real time clustering analysis for streams is needed. A multitude of stream clustering algorithms were introduced. However, assessing the effectiveness of such an algorithm is challenging, because up to now there is no tool that allows a direct comparison of these algorithms. We pre...
متن کاملEffective Evaluation Measures for Subspace Clustering of Data Streams
Nowadays, most streaming data sources are becoming highdimensional. Accordingly, subspace stream clustering, which aims at finding evolving clusters within subgroups of dimensions, has gained a significant importance. However, existing subspace clustering evaluation measures are mainly designed for static data, and cannot reflect the quality of the evolving nature of data streams. On the other ...
متن کاملA New Mathematical Model for the Prediction of Internal Recirculation in Impinging Streams Reactors
A mathematical model for the prediction of internal recirculation of complex impinging stream reactors has been presented. The model constitutes a repetition of a series of ideal plug flow reactors and CSTR reactors with recirculation. The simplicity of the repeating motif allows for the derivation of an algebraic relation of the whole system using the Laplace transform. An impinging stream...
متن کاملAdaptive Mining Techniques for Data Streams using Algorithm Output Granularity
Mining data streams is an emerging area of research given the potentially large number of business and scientific applications. A significant challenge in analyzing/mining data streams is the high data rate of the stream. In this paper, we propose a novel approach to cope with the high data rate of incoming data streams. We termed our approach “algorithm output granularity”. It is a resource-aw...
متن کاملDivisive clustering of high dimensional data streams
Clustering streaming data is gaining importance as automatic data acquisition technologies are deployed in diverse applications. We propose a fully incremental projected divisive clustering method for high-dimensional data streams that is motivated by high density clustering. The method is capable of identifying clusters in arbitrary subspaces, estimating the number of clusters, and detecting c...
متن کامل